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Chapter 1 
INTRODUCTION 


Improvements and advances in the development of computer 
architecture , both hardware and software structures , now provide 
innovative technology for the recasting of traditional sequential 
solutions into high-performance, low-cost, parallel systems to increase 
system performance. New processes and methodologies influence the 
implementation of real-time systems like guidance, control, avionics, 
robotics, and so on. The increasing demand of faster, real-time 
computation speed can be met with parallel path approaches at the 
algorithmic, as well as, hardware levels. 

This report is the result of research conducted in development of 
specialized computer architecture for the algorithmic execution of a 
avionics system, guidance and control problem in real time. The 
objective of this research is to enhance vehicle guidance resulting 
from optimal guidance strategies by incorporating high-speed parallel 
processing. This report presents a comprehensive treatment of both the 
hardware and software structures of a customized computer which performs 
real-time computation of guidance commands with updated estimates of 
target motion and time -to -go. 

Optimal control strategies are available for use in many guidance 
problems. In this research, the performance index for optimal guidance 
is chosen as a quadratic function of terminal miss and control action 
costs. A set of coupled, first order differential equations are solved 
to compute the optimal commanded acceleration of the advanced space 
vehicle . 
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To exploit a high degree of concurrency, the sequential algorithm is 
restructured. A parallel, multi-step, predictor corrector numerical 
integration technique is employed to solve the set of differential 
equations. A subsequent effort is devoted to segmenting or decomposing 
the algorithm into parallel and concurrent processes. Evaluation of 
derivatives and the integration of state variables are partitioned as 
distinct tasks. A task graph is constructed by considering the sequence 
in which the tasks are to be executed, satisfying all the precedence 
constraints . 

An important aspect of this research is the development of an 
optimum, real-time allocation algorithm which maps the algorithmic tasks 
onto the processing elements. This allocation is based on the critical 
path analysis, a widely used technique in graph theory and operations 
research. For the particular task graph considered, an optimal alloca- 
tion has been obtained with 29 processing elements. This enables the 
execution of the graph in the minimum possible time, as dictated by the 
precedence constraints of the graph and availability of resources. 

The final stage is the design and development of the hardware 
structures suitable for the efficient execution of the allocated task 
graph. The system is data-driven, i.e., when the necessary operands for 
a task arrive at a particular processing element, the task is immedi- 
ately executed. The basic system architecture consists of two star- 
shaped clusters , each consisting of 64 processing elements. A high- 
speed, buffered, crossbar, delta network allows parallel communication 
between pairs of processing elements within a cluster. 
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The processing element is designed for rapid execution of the 
allocated tasks. It contains local storage for both instructions and 
operands and extensive fault tolerance capabilities. Fault tolerance is 
a key feature of the overall architecture. 

The remaining chapters of this report consider the various aspects 
of the research in complete detail. In the second chapter, the guidance 
problem is completely examined and mathematically defined. Chapter 3 
deals with the restructuring of the sequential algorithm. Parallel 
numerical integration techniques, task definitions, and allocation 
algorithms are discussed in the third chapter. In Chapter 4 the 
parallel implementation is analytically verified and the experimental 
results are presented. Chapter 5 discusses the design of the data- 
driven computer architecture, customized for the execution of the 
particular algorithm. Some conclusions and recommendations are made in 
the last chapter. 
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Chapter 2 

THE GUIDANCE AND CONTROL PROBLEM 
2.1 BACKGROUND AND OBJECTIVE 

The objective of this research is to investigate the enhancement of 
vehicle guidance resulting from the incorporation of optimal guidance 
strategies made possible by high speed parallel processing in guidance 
computation. The critical objective of this study is the determination 
of realistic cycle periods for repetitive, real-time computation of 
guidance commands with updated estimates of target motion and time- 
to-go. 

For some time maneuvering vehicles have employed some form of 
proportional guidance, which is optimal, or near optimal in many 
engagements. In other conditions, its performance may be acceptable but 
less than perfect. Due mainly to recent advancements in microprocessor 
technology, more sophisticated techniques of advanced estimation and 
control theory may be implemented in a relatively small and inexpensive 
avionics package. 

A number of studies have been directed toward the application of 
optimal control theory and estimation to a related guidance area [1-10] . 
Simulation results have indicated improved performance subject to the 
suitability of the performance criteria, the critical estimates of 
time-to-go and target acceleration, and the sensitivity of the guidance 
law to the assumed missile dynamics. 

In at least one case, optimal gv Ldance strategies were used in a 
highly accurate, nonlinear, six degree-of-freedom simulation of a 
tactical missile [5-8]. Guidance gains were computed as a function of 
time-to-go and constant target acceleration for a second-order, linear 
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model system and used in the nonlinear simulation with various estimate 
procedures for time- to- go and target acceleration. The results were 
promising, but it was difficult to accurately assess the contribution of 
model error, estimation error, or nonjudicious performance indices to 
the miss distance. 

With the prospect of extremely rapid computation of optimal guidance 
algorithms with specially designed computer architectures and parallel 
processing, as well as improved estimates of target acceleration, there 
is the prospect of solving the equations for guidance gains repetitively 
during the course of the intercept using more accurate dynamic models 
with adjustable parameter values and fresh estimates of target motion. 


2 . 2 DEVELOPMENT 

This section examines minimum control cost, minimum terminal miss 
guidance for the intercept of a moving target by a vehicle with inherent 
airframe and control system dynamic properties . Particular attention is 
given to the idealized problem of zero terminal miss, wherein the 
control gains are given in terms of the state transition of the 
uncoupled airframe dynamics. This approach separates the kinematic 
portion of the intercept dynamics , which is common for all intercept 
problems, from the kinetic portion of the vehicle dynamics. The 
particular problem structure makes the results applicable to a variety 
of engagement problems in which the vehicle airframe can be represented 
by a linear model. 

The resulting control law has a term related to the intercept 
kinematics, which is recognizable as the familiar generalized 
proportional navigation term. A second term in the control law is a 
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linear function of the vehicle airframe state and represents the 
guidance compensation due to finite airframe dynamics . The final term 
in the guidance law is related to target motion, providing an effective 
control in cases where target motion can be measured or predicted 
accurately. 

The formulation structures the guidance problem for separation of 
the intercept kinematics from the dynamics of the vehicle . The motion 
of the target is accepted as an uncontrollable input to the problem; 
however, the kinetic state equation can be augmented with a target model 
if available. 


2.2.1 Kinematics 

Let the vector position and velocity of the target (T) and 
controlled vehicle (I) be represented in an inertial frame by y and v, 
as illustrated in Figure 1. 


y x - vj y T - v T 

Vj — aj vip — a^ 

Defining the relative position and velocity of the target with 
respect to the missile yields 


( 1 ) 


( 2 ) 


y - v 
v - a-j - aj 

Letting the vector x represent the kinematic state of the intercept, 


( 3 ) 


then 
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x - Ax + B(aj - a<p) 

where 


“ 

* 



0 

I 

B - - 

0 

0 

0 


I 

_ 

_ 


. 


In (3) the identity and null submatrices reflect the dimensionality of 
the problem. 

2.2.2 Kinetics 

The airframe/control response state is designated as z and satisfies 
the linear equation 

z - Dx + Ez + Fu (4) 

subject to the airframe control u (thrust, control surface deflection, 
etc.). The dimensions and components of the coefficient matrices in (4) 
are vehicle -dependent and provide the generality in the problem. 

The intercept kinematics are coupled to the airframe dynamics by 

aj - Gz + Hu (5) 

Thus any linearized airframe describable by (4) and (5) is subject to 
the analysis. 
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2.2.3 Optimal Guidance 

The conventional guidance performance index for targeting vehicles 
is of the form 


( 6 ) 


J - 0. 5x(tf) T Sfx(tf) + J* (0.5u t Ru + r)dt, 

c o 

where Sf, R, and T weigh the costs associated with terminal miss, 
control cost and time, respectively. In those cases where the terminal 
miss is the significant cost, it is logical to constrain the final 
position y(tf) to zero and to develop the corresponding control law 
under this condition. Equation (6) may be replaced by 

(0 . 5u T Ru + T)dt (7) 


and 


where 


Tx(t f ) - 0 


T - 


I 0 


The equations (3-7) are collected below. 
Minimum Miss 

x - Ax + BGz + BHu - Ba T 


z 

J 


Dx + Ez + Fu 
0 . 5x(tf ) T Sfx(tf ) + 


r f 

( 0 . 5u t Ru + Ddt 


( 8 ) 
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Zero Miss 


x — Ax + BGz + BHu - Ba>j< 
z - Dx + Ez + Fu 
Tx(t f ) - 0 


J *■ 


| (0.5u t Ru + T)dt 
c o 


( 9 ) 


2.3 MINIMUM TERMINAL MISS 

If the performance index in (8) is augmented in the usual manner, 
the Hamiltonian for fixed terminal time is 

H - 0. 5u^ Ru + [Ax+BGz+BHu-Ba-j] + [Dx+Ez+Fu] (10) 

The resulting boundary value problem is 
x — Ax + BGz + BHu - Ba^ 
z - Dx + Ez + Fu 
A - -A T A - D T m 
n - -G T B T A - E T fi 
u -- -R*^(H T B T A + F T /i) 

The computation of the control gains is achieved via the inverse 
formulation 

x - QjA + Q2 1 * + Q3 
z - Q4A + Q5M + Qg 


x(t Q ) 

z(t 0 ) 

A(t f ) 

M(tf) 


- 


- Z r 


( 11 ) 


Sfx(tf) 

0 
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leading to the equations 


Ql - AQ X + Q 1 A T + Q 2 G T B T + BGQ 2 T - BHR* 1 H T B T 

Ql(tf) 

- s f -i 

Q 2 - aq 2 + q 2 e t + q x d t + bgq 5 - bhr _1 f t 

Q2( c f) 

- 0 

Q 3 - AQ 3 + BGQg - Baj 

Q3(Cf) 

- 0 

1 

£ 

H 

04<tf> 

- 0 

q 5 - eq 5 + q 5 e t + dq 2 + q 2 t d t - fr _1 f t 

Qs(tf) 

- S f *-1 

Q6 - e Q6 + °Q3 

Qe(Tf) 

- 0 


The nonsingular diagonal matrix Sf* is used for the computation of 
the inverse problem and its elements are set to zero in the solution for 
the gains. 

The resulting control vector u is 
u - Kjx + K22 + K3 

where 

Ki - -R' 1 <H t B t P 1 + F T P 2 T ) 

’ K 2 - -R* 1 (H T B T P 2 + F T P 5 ) (13) 

K 3 - -R _1 (H T B T P 3 + F T P 6 ) 

and 

Pi - [Qi - QzQs'Vl' 1 

?2 " - [Qi - Q2 Qs* 1 Q2 T 3' 1 Q 2Q5’ 1 

P5 “ [Qs - Q2 T Ql’ 1 Q2] * 1 (14) 

P 3 - -PiQ 3 - p 2 q 6 
p 6 - -p 2 t q 3 - p 5 q 6 
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The particular case of interest is that in which the kinetics of the 
vehicle are independent of the intercept position and velocity, i.e., 

D - 0 

In this case, the equations (11) are integrable and yield 


Qs - 


f £ 

* E (t-r)FR _1 F T $ E T (t-r) dr + ^(t-t^Sf** 1 ^ 1 ^-^ ) 


t 


Q 2 - 


t 


[HR ' 1 ? 1 


GQ 5 (r) ] * E T (t-r) dr 


(15) 


Qi - 


$ A (t-r) [BHR' 1 H T B T -BGQ 2 T (r)-Q 2 G T B T ]$ A T (t-r)dr 
+ $ A (t-t f )S f ' 1 $ A T (t-t f ) 


Q6 “ 0 


Q3 


r f 

- $ A (t-r) Ba T (r)dr 


t 


where 


and 



tl 

I 


” e$ E 


$ E (°) - I 


The elements of Sf* are set to zero in the resulting expressions for the 
elements of Q. 
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The minimum terminal miss control law is determined by (13) *(15). 
The case where the terminal miss weighting is large may be solved by 
constraining the terminal position, as is done in the next section. 


2.4 ZERO TERMINAL MISS 

The zero terminal miss and minimum control cost formulated in an 
earlier section and outlined in (9) is solved here. This problem is 
also treated in Reference 4. The augmented performance index for the 
fixed- time problem is 

f f 

J - v^Tx(tf) + I [ 0 . 5u^Ru + A^ ( Ax+BGz+BHu- Ba>j> ) + ^(Dx+Ez+Fu) ] dt 
t 

yields the boundary value problem described by 


Ax + BGz + BHu - Ba<j 

x(t 0 ) - x 0 


Dx + Ez + Fu 

N 

ft 

O 

1 

N 

O 


-A T A - D T fi 

A(t f ) - T T i/ 

(16) 

-G T B T A - E T m 

M(t f ) - 0 



Selecting 


A *“ S]_x + S 2 z + S3V + S 4 
ft ~ S5X + Sgz + S-jv + Sg 


( 16 ) becomes 


s x + A^Sj. + SjA 2 + D^Sj + S 2 D 2 — 0 

Sx(t f ) - 0 

S 2 + A Ts 2 + S 2 E 2 + D^Sg + Sj_B 2 “ 0 

S 2 (Tp) “ 0 

S3 + A^S3 + D^S 7 + S^C 2 + S 2 F 2 “ 0 

s 3 (t f ) - t t 

s 4 + a*s 4 + s 2 h 2 + s^g 2 + D^S 8 “ 0 

S 4 (t f ) - 0 

S5 + E^Ss + S5A2 + G^B^S]_ + SgD 2 — 0 

Ss(tf) - 0 

Sg + E^Sg + SgE 2 + G^B^S 2 + S5B 2 « 0 

Sg(t f ) - 0 

S 7 + E^S 7 + G^B^S3 + S5C2 + SgF 2 “ 0 

s 7 (t f ) - 0 

Sg + E^Sg + G^B^S 4 + S5G2 + SgH 2 “ 0 

C/3 

00 

ft 

Hi 

w 

1 

0 


where 


a 2 

- A - 

bhr- 1 (h t b t s 1 + F T S 5 ) 

b 2 

- BG 

• BHR" 1 (H T B T S 2 

+ f*s 6 > 

c 2 

- -BHR' 1 (H T B T S 3 + F 1 

r s 7 > 

d 2 

- D - 

FR _1 (H T B T S 1 + 

F t S 5 ) 

e 2 

- E - 

FR _1 (H T B T S 2 + 

f t s 6 ) 

f 2 

- -FR' 

■ 1 (H T B T S 3 + F T S 7 ) 


G 2 - -BHR- 1 (H t B t S 4 + F t S 8 ) - Ba T 
H 2 - -FR‘ 1 (H T B T S 4 + F t S 8 ) 


Inspection of ( 17 ) and ( 18 ) shows that all unknown matrices except 
S3 and S7 are null. Leaving 


S3 + A^S3 + D^S 7 >■ 0 
S 7 + E T S 7 + G T B T S 3 - 0 
u - -R _1 (H T B T S 3 + F T S 7 )t/ 


S 3 (t f ) - T T 

S 7 (tf) - 0 ( 19 ) 
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The invariance of the terminal manifold 


♦ (t) — SgX + S^qZ + + S^2 

implies 

Sg + SgA + S]_qD “ 0 

510 + S^gE SgBG ~ 0 

5 11 ' (SgBH + S 10 F)R-1 (hT B Ts 3 + F T S ? ) - 0 
S^2 “ SgBaf — 0 

where 

v - -Su'^Sgx + S^gz + S^2 ] 
Therefore the control law for zero miss is 


Sg(t f ) - T 
SioCtf) “ 0 

( 20 ) 

S u (f f ) - 0 
Si2(Cf) - 0 


u - K^x + K 2 Z + K 3 

where the navigation, vehicle airframe, and target position components 
of the position and velocity control are evident. 


K l - R* 1 (H T B T S 3 + F T S 7 )S 11 * 1 Sg 
K 2 - R _ 1 (H T B T S 3 + F^^S^^g 
K 3 - R' 1 (H T B T S 3 + F T S 7 )S 11 ’ 1 S 12 


For the uncoupled case where D is null, the equations (19) and (20) 
are integrated for computation of the optimal gains . Integration yields 


S 3 - Sg T 


I 

(tf - t ) 1 


s 10 



tf 

Sg(T)BG$ E *(t-r) dr 
t 
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s n (t) - - 


[S 10 (r)F+S 9 (r)BH]R- 1 [S 10 (r)BH] T dr 


( 21 ) 




t 


s 12 (t) 


Sg(r)Ba x (r) dr 


t 


where 

$ E * “ -*e* e ®*(0) - I 


The target motion term of the control for the uncoupled case can be 
written 


where 


Also 


K3 




f 

(t f - r) a T (r) dr 


t 


K 0 - R* 1 (H T B T S 3 + 


K 3 - K q (t f - t) [V A (t) - V T (t)] 


where V x (t) represents present target velocity and V A (t) is the average 
relative velocity on the terminal interval (t, tf ) . The need for 
effective estimation of target motion is recognized. 


2 . 5 IMPLEMENTATION 

To develop an implementation of a real-time optimal guidance and 
control processor which is usable in an adaptive mode by continuously 
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updating the coefficients, a second order, variable parameter vehicle 
model was chosen. This model represents only a single dimension of the 
problem, but is highly representative of an actual system. Furthermore, 
the control computation for the other plane is inherently a totally 
parallel process. A minimum terminal - miss , minimum control-action 
optimal strategy was chosen. 

To implement the algorithm in a realistic planer problem the 
following scalar dynamic equations are chosen: 


y - v 
v - a-j - aj 

with the airframe model responding in accordance with 


a l + 2 r«n a I + w n 2 ax - o n 2 a c 

where a c is the commanded acceleration specified by optimal guidance. 
The performance index for optimization is chosen as 


J 


0.5sy 2 (t f ) + 0.5r 



(t) dt 


where s and r are scalar performance parameters weighing terminal miss 
and control action costs. 

System parameter matrices corresponding to equation (8) are 



B - 


0 

-1 


I 

L J 

1 

L J 




o 

o 


0 1 


0 

D - 


E - 


F - 



0 0 


-w n 2 -2rw n 2 


"n 2 
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1 0 


H - 0 


G - 


S f - 



0 

0 


R - r 


The optimal commanded acceleration is then given by: 


a c - - w n 2 /r[ S 4 y + S 7 v + S 9 aj + Sj^aj + S 14 a T ] 
where the optimal gain coefficients satisfy 


S L - aS 4 2 - 0 

Si(t f ) 

S 2 + Si - aS 4 S 7 — 0 

S 2 (tf ) 

S 3 - S 2 -/JS 4 - aS 4 S 9 “ 0 

S 3 (tf ) 

S 4 + S 3 - yS 4 - aS 4 S^Q “ 0 

S 4 (t f ) 

S 5 + 2S2 * arSy 2 ” 0 

S 5 ( tf ) 

Sg + S 3 - S5 - /9Sy - aSySg “ 0 

s 6 ( fc f) 

s 7 + s 4 + S 6 - 7 S 9 - aS 7 S 10 _ 0 

S7 ( - t f'> 

Sg - 2Sg - 2/3S 9 - aSg 2 - 0 


s 9 + S 8 " s 7 * 7 S 9 -/9S 10 ' aS 9 s 10 “ 0 

Sg ( tf ) 

s 10 + 2s 9 * 2 T s 10 ' aS 10 Z “ 0 

Sio(tf) 

S^]_ - aS 4 S ^ 4 + S 2 aj — 0 

Sn(tf) 

S 12 + Sj_j_ - aS 7 S ]_ 4 + S 2 aj — 0 

s 12 < t f) 

S 13 - S ^2 * /9S ^ 4 -aSgSi 4 + Sga^ - 0 

S 13^ c f ) 

s 14 + s 13 * tS ]_ 4 - aS 10 s 14 + S 7 a T - 0 

Si4(tf) 


and 

“ “ w n 4 / r 
f* ' “n 2 

y - 2 $> n 


-s 


-0 


-0 


-0 


-0 


-0 


-0 


-0 


-0 


-0 


-0 


-0 


-0 


-0 
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These equations, processed from tf to 0, are the objective guidance 
equations for implementation of the real-time processor developed in the 
subsequent chapters . 
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Chapter 3 

PARALLEL IMPLEMENTATION 


In this chapter the restructuring of the sequential algorithm is 
discussed so that a real-time, efficient, and parallel implementation 
can be developed. The performance of the modified algorithm and the 
experimental results are discussed in Chapter 4. The steps of the 
restructured algorithm are outlined in Section 3.1. Parallel numerical 
integration techniques are reviewed in Section 3.2. Section 3.3 presents 
the concepts of a task graph. Section 3.4 deals with the allocation 
problem in general, and two allocation schemes in specific. 


3 . 1 RESTRUCTURING 

To calculate the commanded acceleration of the vehicle in one 
direction 14 ordinary differential equations must be solved in real- 
time. Speed and accuracy are the salient requirements. The computer 
architecture suitable for this parallel algorithm is of no less impor- 
tance and this is the focus of Chapter 5. Here only the algorithmic 
implications are considered. 

The key to parallel implementation is identifying as many operations 
as possible to be executed in parallel and removing dependencies 
wherever feasible. The overall problem is segmented into several tasks 
which are simultaneously executable. These asynchronously cooperating 
tasks will be executed on different processors if available. 
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The basic steps of the restructured algorithm are as follows : 

1. ) Remove the sequential bottlenecks of numerical integration 

of ordinary differential equations. In the past three 
decades several authors have come up with efficient and 
parallel numerical integration techniques . 

2 . ) Segment the evaluation of derivatives and the integration of 

state variables into several distinct tasks. 

3. ) Construct a maximally parallel task system considering all 

the precedence constraints. 

4 . ) Develop an efficient allocation algorithm to schedule the 

tasks to different processing elements in a multiprocessor 
environment . 

3.2 PARALLEL NUMERICAL INTEGRATION TECHNIQUES 

There has been some effort over the years to speed up the numerical 
integration of an ordinary differential equation (ODE) . The old 
sequential techniques have been modified so that they are suitable in an 
environment with a plurality of processors. In this section some of 
these parallel numerical integration techniques are reviewed. 

The parallel methods compute the solution of a set of n O.D.E.'s 
developed by 

y(t) - f (t,y(t) ) , y(t Q ) - y Q . 

Most methods generate y n , an approximation to y(t n ) on a mesh 
a - tg<t^<t 2 <. . .<1^ - b. These are called step-by-step difference 
methods. An r-step difference method is one which computes y n+ ]_ using r 

earlier values y n , y n+ x Yn-r+l’ This numerical integration by 

finite differences is a sequential calculation. Lately, the question of 
using these formulas simultaneously on a set of arithmetic processors to 
increase the speed has been addressed by many authors. 
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3.2.1 Runge-Kutta (RK) Methods 

The general form of an r-step RK method, the integration step 
leading from Y n to Y n+ ^ consists of computing 

Ki - hj^f (tjj.Yn) 

i-1 

K i ” ^(tn + Y n + ^ i b ij K j ) 

r 

Y n+1 - Y n + I R i K i 
i-1 

with appropriate values of a's, b's, and R's. A classical 4-step RK 
method is 


Ki - hf(t n ,Y n ) 

K 2 - hf ( 1 ^ + h/2 , Y n + (1/2)^) 

K 3 - hfCt^ + h/2, Y n + (1/2)K 2 ) 

K4 - hf^ + h, Y n + K 3 ) 

Y n+1 - Y n + (1/6) (Ki + 2K 2 + 2K 3 + K 4 ) 


Miranker and Liniger [11] developed parallel Runga-Kutta formulas. In 
the parallel computation of a third-order approximation y^, first-and 
second-order approximations, ygl and yg 2 , respectively in addition to 
ygl must be computed. As a consequence, the third-order parallel method 
gives a second-order parallel scheme as a by-product. The formulas of 
the parallel schemes have the structures: 

first order: K^ - hfCt^, Y T 1) 

RK1 

*n l +l - Iff 1 + *1 
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second order: K^ 2 - K^ - h fCt^, Y^) 

RK2 

K 2 - h f^ + ah, Yn 1 + bK] 2 ) 

Y n 2 +1 “ R l 2 Ki 2 + R 2 2 K 2 

third order: K^ 3 - K^ 

RK3 

K 2 3 - k 2 

K 3 - h fftn + ah , Y n 2 + bK x 3 + cK 2 3 ) 

Y n 3 +1 “ R l 3 R l 3 + r 2 3 k 2 3 + r 3 3 k 3 

The parallel character of the above formulas Insures that RKi Is 
Independent of RKj if and only if i<j , i,j-l,2,3. This means that if 
RKI runs one step ahead of RK2, and RK2 runs one step ahead of RK3 , then 
they can be executed simultaneously. Using Kopal's [12] values of the 
R's, the parallel third order RK formula is given by: 

Rl n+2 - hf<tn+ 2 . Yl n+2> 

Yl n+3 “ Yl n+2 + 

K 2 n +l ” ^(tn+l + Y ^n+1 + ^^n+l ) 

Y 2 n+2 * Y 2 n+1 + d-d/2a))Rl n+1 + (l/2a)K 2 n+1 
K 3 n - hf(tn + ajh, Y 2 n + (a L - (l/6a) +<l/6a)K 2 n ) 

Y 3 n+1 “ Y 3 n + ( (2a 1 -l)/2a) (K^ - K 2 n ) + K 3 n . 
where a - 2(l-3a 1 a 1 )/(3(l-2a 1 ) ) . 

The above 3rd order RK formulas require 3 processors to compute the 
three functions in parallel . The main drawback of this scheme is that 
it is weakly stable and leads to an error that grows linearly with n. 
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3.2.2 Interpolation Method 

Nievergelt [13] proposed a parallel form of a serial integration 
method in which the algorithm is divided into several subtasks which are 
computed independently. The idea is to divide the integration interval 
[a,b] into N equal subintervals [t^.^, t^] , tg-a, t^— b, i-1,2,3, . . . ,N, 
to make a rough prediction yi® of the solution y(t^), to select a 
certain number of values y*J , j-l,...,M^ in the vicinity of yi® and 

finally to integrate the system with an accurate integration method M. 

y-f(t.y), y(t 0 ) - y 0 , t e [tg.t^J 

y-f(t.y), y(t t ) -yij, te [t i( t i+1 ] , j-i...,M if i-l,...,N-l. 

The integration interval [a,b] is covered with solution segments of 
equal length, (b-a)/N. The connection between these branches is made by 
interpolating the previous solution segment over the next interval to 
the right. The time of this computation can be represented by 

Tpi - time for serial integration/N + time to predict yi® + 

interpolation time + bookkeeping time. 

Interpolation can be done in parallel. If it is assumed that the 
time consuming part is the evaluation of f(t,y) and the other contribu- 
tions to the total computation time are negligible, the speed up is 1/N. 
But to compare this method with serial integration from a to b using 
method M, the error introduced by method M is significant. This error 
depends on the problem, not on the method. For linear problems the 
error Is bounded, but for nonlinear problems it may not be. Thus, the 
usefulness of this method is restricted to a specific class of problems, 
and depends on the choice of parameters like yi®, and method M. 
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3.2.3 Predictor Corrector (PC) Methods 

One step methods do not make full use of the available information. 
It seems plausible that more accuracy can be obtained if the value of 

yn+1 *- s made to depend not only on y n but also, on y n -i and f n .j_, 

f n _ 2 . • • • • For this reason v multi-step methods have become very popular 
For high accuracy they usually require less computation than one -step 
methods . 

A standard fourth- order serial predictor corrector given by 
Adams -Moulton is: 

Y p i+1 - Y 6 ! + (h/24) (55 f^ - 59 ^^ + 37f C i . 2 - 9 ^. 3 ) 

^i+l - Y«i + (h/24)(9fP 1+1 + lPf^ - 5^.2 + f c i. 3 ) 

The following computation scheme, PECE, of the PC step to 
calculate yi+i is: 

1. Use the predictor equation to calculate and initial approxi 
mation to y^+i. 

2. Evaluate the derivative function fP^ + ^ . 

3. Use the corrector equation to calculate a better approxima- 
tion to yi+ 1 - 

4. Evaluate the derivative function f°^ + ^ . 

5. Check the termination rule. If it is not time to stop, 
increment i, set y^+i - y°i+l and return to step 1. 

Let 

Tf - total time taken by function evaluation done for one 
step 

Tpcg “ time taken to compute predictor (corrector) value for a 
single equation 

then time taken by one step in serial predictor corrector is 
T - 2(nTpcE + Tf) 

The serial method is schematized in Fig. 3.1 
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Fig. 3.1 Serial Predictor Corrector method 

The upper line represents the progress of the computation at the 
mesh points for y n p and f n p , and the lower for y n ^ and f n ^. The broken 
vertical line is the computation front. The calculations ahead of the 
front depend on information on both sides. This a characteristic of 
sequential calculation. 

Miranker and Liniger developed formulas for the PC method in which 
the corrector does not depend serially upon the predictor, so that the 
predictor and corrector calculations can be performed simultaneously. 

The parallel predictor corrector (PPC) also operates in a PECE mode, and 

the calculation advances s steps at a time. There are 2s processors and 

each processor performs either a predictor or a corrector calculation. 

A fourth order PPC is given by: 

yi+l P - yi-l C + (h/3) (8fj_P * 5f iml c + 4f i . 2 c - fi- 3 C > 

yi C - yi-l C + (h/24) (9fjP + - 5^.2° + fi. 3 C > 

The method is schematized in Fig. 3.2. 
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Fig. 3.2 Parallel predictor Corrector Method 


The computations at points ahead of the front depend only on 
information behind the front, a characteristic of parallel 
computation. The sequence of computation is divided and each of its two 
parts 

y n p +i- f n-i c and y n c . f n c 

may be simultaneously executed on separate processors. 

As shown in [14] , the parallel time for a single step of the 
fourth order PPC method is given by: 

T - nTpc E + tf + 3nT D c + 2T S 

where 

Tpc£ - Tf as defined before and 

Tdc " time taken for data communication 

T s — time taken for synchronization 

For 4 processors (s-2) the parallel PPC formulas are: 

Y p 2n+2 " *®2n-2 + * hfP 2n 

Y p 2n + 1 ‘ Y°2n-2 +<3h/2)(fP 2n + fP 2n -i> 

^2n - ^2n-2 *(h/2)(3fP 2n + 9fP 2n _ 1 ) 

YC 2n-l " Y °2n-3 + 2hfC 2n-2- 
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Generally higher accuracy and fewer function evaluations of PC 
methods, as compared to RK methods, are obtained at the cost of increased 
complexity and possible numerical instability. The parallel RK methods 
do not inherit the stability of their serial counterparts. On the other 
hand, PPC methods are as stable as their serial formulas. This is 
proven by Katz, et al., [15]. 

3.3 GENERATION OF THE TASK GRAPH 

A task is defined as a unit of computational activity specified in 
terms of the input variables that it requires, the output variables that 
it generates, and its execution time. The specific transformation that 
it imposes on the inputs to produce the output is not a part of the 
specification of the task. Thus, the task may be considered uninter- 
preted. 

Let J -(T^, T 2 ,..., T n ) be a set of tasks and <. an 
irreflexive partial order or procedure relation defined on J. Then 
C-(J, <.) is called a task system. The precedence relation means that, 
if T<.t/, then T must complete execution before T^ is started. 

From this definition a graphical representation, called a task 
graph, is obtained for a task system. This consists of a directed graph 
whose vertices, or nodes, are the tasks J and which has an edge from T 
to t/, if TC.T/ and there is no t// such that T< . t//< . t/ . Thus, the set 
of edges in the task graph represents the smallest relation whose 
transitive closure is <. . 

With each task T two events are associated, initiation and 
termination. An execution sequence of an n-task system C - (J,<.) is 
any string S - a^, 02 ,..., <* 2 n of task events satisfying the precedence 
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relation (i.e., if TC.t/, the termination event of T must occur 
prior to the initiation event of t/) and consisting of exactly one 
initiation event and one termination event for each task. A task system 
that represents a sequential program has only one execution sequence; 
however, for other task systems there may be several. 

To discuss determinant task systems, let the physical system on 
which task systems execute be represented by an ordered set of memory 

cells M - (M]_, M 2 M m ) . With each task in a system C two, possibly 

overlapping, ordered subsets of M are associated, the domain and the 
range Rj. When T is initiated, it reads the values stored in its domain 
cells, and when it terminates, it writes values in its range cells. 

Given an execution system w for a task system, the value sequence 
V(M ±,6) is defined as the sequence of values written by terminating 
tasks in 6 for which € R-j. 

The intuitive concept of determinant task systems is more rigorously 
defined as follows : 

A task system C is determinant, if for any given initial state ?q, 

V(M ^,5)- VCM^.S/), i € {l,m] for all execution sequences S and 5 /. 

From this definition, it is clear that a task system representing a 
sequential program is determinant since there is only one execution 
sequence. Two task systems both consisting of the same tasks are said 
to be equivalent if they are determinant and produce the same value 
sequences for the same initial state. The goal is to convert the 
determinant task system reprjsented by the sequential algorithm into an 
equivalent determinant task system. which has more parallelism. 
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Given a task system C, tasks T and t/ are noninterfering if Tc.T^ or 
t 1< - T. Task systems consisting of mutually noninterfering tasks are 
determinant [16]. With the above background in mind the task graph is 
generated. The exact details of this are given in Chapter 4. 


3.4 ALLOCATION 

Given a determinant task system in the form of a task graph and the 
execution time of each task, the next step is the assignment of the 
tasks to p processors. This is termed the allocation phase which is a 
part of the preprocessing stage. 

The following parameters are available for allocation : 

1) a set of tasks J - (T^^ T n ) , 

2) an irreflexive partial order <. on J, 

3) a weighting function W from S to be positive integers, repre- 

senting the execution time of each of the tasks, and 

4) the number of processors p. 

As many as p tasks can be executed in parallel at any time. If task 
T is first executed at time t using processor K, then it is executed 
only at times t, t+1, . . . , t+W(T) -1 using processor K each time. This is 
an example of non-preemptive allocation, where once a task is assigned 
to a processor it must be completed before any other task is assigned to 
the same processor. An additional requirement is that any task t/, such 
that t/<.T, complete execution at time t/ where t/<t. 

A schedule is an assignment of tasks to processors that satisfies 
the above conditions and has length tmax, where tmax is the maximum time 
at which the termination events occur for all tasks. The allocation 
problem is the determination of an assignment that minimizes tmax with a 
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minimum number of parallel processors. This type of problem has been 
studied extensively by a number of pioneering researchers [17]. It has 
been shown to be NP-complete [18] and can be considered intractable. 

When the number of processors, the task processing times and the 
precedence constraints are all arbitrary, the complexity of such an 
allocation problem becomes NP-hard in the strong sense. Hence, unless P 
equals NP, it is impossible to construct either a pseudopolynomial time 
allocation algorithm or a fully polynomial time approximation scheme 
[19]. 

In order to circumvent these difficulties, heuristic algorithms have 
been considered to be the most powerful tools. Indeed, the critical 
path (CP) method [22] and HLFET (highest levels first with estimated 
times) [20] , which essentially is a sort of list scheduling method, are 
proposed. 

Two different allocation schemes are discussed in this chapter. The 
first, proposed by Kasahara and Narita [21], is known as the CP/MISF 
(critical path/most immediate successors first) method, is an improved 
version of the CP-method. The second, a newly proposed algorithm based 
on the application of the branch and bound technique, is termed BBAS 
(branch and bound allocation scheme). 

3.4.1 The CP/MISF Method 

A critical path is defined as the path from the exit node to the 
entry node having the longest path length In mathematical terms 

ter — max £t^ 
k ieQ^ 

where 9^ represents the kth path from the exit node to the entry node. 
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ter is equal to the minimum possible execution time for a plurality of 

parallel processors to process the tasks involved in a given task graph. 

The level 1^, defined for each task, serves as the basis for 

constructing the priority list. The level 1^ of task i is defined to be 

the longest path length from the exit node to task i, or 

li - max It* 
k jen£ 

where n k stands for the kth path from the exit node to Nj_ . 

The CP method is essentially the generalization of Hu's algorithm 

[22] . Since the priority order cannot be uniquely determined when there 
exists more than one task having the same level, the worst schedule may 
result depending upon the task chosen. In this method when two tasks 
have the same level, the task having the largest number of immediately 
successive tasks is assigned the higher priority. 

The CP/MISF method consists of the following steps: 

Step 1: Determine the level 1^ for each task. 

Step 2: Construct a priority list in the descending order of 1^ 
and the number of immediately successive tasks. 

Step 3: Renumber the tasks from 1 to n in the descending order 
of priority. 

Step 4: Execute list scheduling on the basis of the priority 
list. 

The problem of determining the level of each task involves the 
calculation of the longest path from the exit node to each node. In the 
case of a single exit node, a dummy exit node is added. Since all arcs 
are directed from the entry node toward the exit node, the longest path 
is measured in the direction opposite to the orientation of each arc. 
This problem can be solved in O(n^) by solving the Bellman's equations 

[23] . 
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Let 


a^j - 0 if link (j,i) exists 
- -® otherwise 

t^ - time for executing task i 
1^ - level of task i 

Then for n tasks, 

^•n “ ^ 

lj - max {1^ + a^j + t j } j - n-1, n- 2, ... ,2,1. 
k-j 


For each node j, j not equal to n, there must be some arc (k,j) in a 
longest path from n to j . Whatever the identity of k, it is certain 
that 1 j - lfc + ajfj + tj . This follows from the fact that the part of 
the path which extends to node k must be the longest path from n to k, 
if this were not so, the overall path to j would not be as long as 
possible. (This is the "Principle of Optimality"). But there are only 
finite choices of k, i.e, k -n, n-l,...,j+l, J-l,...2,l. Clearly k must 
be a node for which lj is as large as possible. In fact the effect of 
the most immediate successive tasks can be incorporated in the same step 
by modifying lj 


lj - lj + imsuccj/n 


where imsuccj is the number of immediate successive tasks for task j and 
n is the total number of tasks. 

Actual experimental results with respect to the specific problem are 
discussed later. It has been shown [21] that optimal solutions were 
obtained for about 67 percent of some 200 cases tested by the CP/MISF 
method. Approximate solutions with error less than 5 percent were 
obtained for 87 percent of the cases and those less than 10 percent for 
98 . 5 percent of the cases . 
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3.4.2 The BBAS 


Branch and bound implicit enumeration algorithms have emerged as the 
principal method for finding optimum solutions to discrete optimization 
problems. Kohler's [24] general representation can be used for the 
classification of the branch and bound technique. In the expression 
BB(Bp,S,E,L,U,RB) each parameter has the following significance: 

Bp : branching rule 

S : selection rule of next branching node 
E : elimination rule 
L : lower bound function 
U : upper bound cost 
RB : resource bounds 

The proposed algorithm works by partitioning the set of schedules 
into smaller and smaller subsets, finding lower bounds on total execu- 
tion times of each of the subsets, and using these bounds to guide 
further partitioning until a single schedule is obtained whose total 
execution time is less than or equal to the lower bounds of all the 
other subsets. 

Branching Rule, B p : 

An allocation instant is defined as the time when one or more processors 
have just finished execution of the allocated task(s) and succeeding 
task(s) becomes executable. Since the task times are different, there is 
the possibility that the optimal schedule may not be obtained by simple 
list scheduling methods. At each stage of branching procedure, nodes 
should be generated to include the cases where a processor or processors 
become idle. To this end fictitious tasks which correspond to idle 
processors are introduced, as done in [21]. These idle tasks, together 
with ready tasks are allocated. 
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N^die “ M av -1 for H a y “ M 
- M av for l<M av <M. 

where Nj^g is the number of idle tasks, N rea< jy is the number of ready 
tasks, M av is the number of available processors, and M is the total 
number of processors. 

Then the number of nodes generated from each branching node is given 
by 

^branch “ ^alloc ^Mav 
where C is the number of combinations and 

^alloc “ N ready + ^idle • 

The set of allocatable tasks is represented by A. 

Selection Rule, S : 

The selection rule is used to choose the next branching node from the 
set of currently active nodes. The rule used in the algorithm is least 
lower bound or LLB. The next branching node is chosen by calculating 
and comparing the lower bounds for all the active nodes at each branch* 
ing instant. 

Lower bound, L : 

The lower bound, in our case, is simply the total execution time for the 
partial schedule represented by each node. 

Upper bound cost, U : 

When the solution of the original problem is known a priori, its value 
can be used as U. Otherwise, set U equal to «. The value U is updated 
whenever a smaller solution u/ is obtained. The smaller the value of U 
at an early stage of the search process, the shorter is the search time, 
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and a reasonable value of U is evaluated with the help of a heuristic 
algorithm. 

Elimination Rule, E : 

To eliminate some of the active nodes the following rule is employed. 
Whenever the lower bound LB(n^)/u/, the node n^ is eliminated. 

Resource Bound, RE : 

This is the allowable computing time limit and storage capacity limit. 

At first glance the simplistic BBAS seems to have enormous time and 
space complexities , but the greatest advantage of this scheme is its 
inherent parallelism. The potential parallel paths in the control flow 
of this algorithm may be explicated and computed by multiple processes. 
In other words the loop is unfolded to let the multiple processes work 
on different iterations of the unfolded loop. This algorithm is in its 
rudimentary stage and will be further investigated later. Presently the 
authors are working on a possible parallel implementation of this new 
allocation scheme. 
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Chapter 4 

VERIFICATION & EXPERIMENTAL RESULTS 


The restructured parallel algorithm was verified to check its 
validity with respect to the actual problem in hand. The experiments 
were mainly performed to show the improved performance of the new 
parallel approach compared to the conventional sequential one. Results 
are shown corresponding to one iteration of the integration of all the 
14 differential equations, listed in Chapter 2. Section 4.1 deals with 
the construction of the task graph. The implementation of the allocation 
algorithm is discussed in Section 4.2. The experimental results are 
analyzed in Section 4.3. 

4.1 CONSTRUCTION OF TASK GRAPH 

The fourth order, 2 'processor parallel predictor corrector method, 
outlined in Chapter 3 , is chosen for solving the differential equations . 
The basis of constructing the task graph lies in the definition of tasks 
and their appropriate precedence constraints. Basically there awe two 
types of operations , updating the dependent variables and calculating 
the functions. Each update of a dependent variable is defined to be a 
task. Hence, for the fourteen differential equations involving P^, 
P2 »-..Px 4 there are twenty-eight different tasks as follows, (Pi)P and 
(Pi) c for i - 1...14. Note that for a particular iteration level j, the 
corresponding tasks are (Pi)j+x p and (Pi)j C for i - 1...14. 

Due to the highly coupled nature of the differential equations, 
there are dependencies between the various functions. It is noted that 
decoupling methods can improve the situation. The function values are 
evaluated from the updated dependent variables with each function 
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evaluation task fragmented into smaller subtasks. These smaller tasks 
are defined in a manner such that each of them have some uniform 
execution time. In this way the task graph becomes more or less balanced 
and parallelism is optimally exploited. 

Some sub-expressions, which are used a number of times, are 
identified. Each of these sub-expressions are defined as a separate 
task to prevent repetitive calculations. There are some other tasks 
required to calculate some constants. 

All the tasks are listed in APPENDIX A. Each task is associated with 
task number, task time, predecessor tasks and successor tasks. Initially 
the tasks are numbered randomly. Later the tasks are renumbered, by the 
allocation algorithm, according to their respective priorities. Task 
times are calculated with the assumption that multiplication and 
addition take 30 and 20 time units respectively. The time units are not 
explicitly stated because they are dependent on the hardware used, and 
hardware is the subject of later research. The predecessor and succes- 
sor tasks for any task are defined in terms of the inputs consumed by 
that task and the tasks receiving its output. 


4.2 ALLOCATION PROCESS 

This is perhaps the most important phase in the parallel method. 

The CP/MISF algorithm, outlined in Chapter 3, was fully implemented in 
PASCAL. The program is listed in APPENDIX B. The allocation process 
translates the task graph into an execution schedule. The execution 
schedule is the sequence at which the tasks are executed by the various 
processors . 
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The allocation program takes the task graph as its input. The input 
is provided in the form of task number, task time and links of the task 
graph. The program is modularized into various procedures taking 
advantage of the structured nature of PASCAL. The 'initialize* proce- 
dure reads the input data from an external file and generates an 
adjacency matrix A such that 

A[i, j ] - 1 if there is a link from task i to task j 

- 0 otherwise. 

The 'level' procedure calculates the level of each task in the manner 
described in section 3.4.1. A matrix B is constructed such that 

B[i,j] - 0 if there is a link from task j to task i 

- -« otherwise . 

The 'renumber' procedure generates the new numbers of all the tasks in 
the descending order of the priorities. The adjacency matrix is 
correspondingly modified. 

Then the 'main' program does the actual allocation job. Before 
allocating a task to a processor, it checks whether the predecessors 
have finished execution and whether the processor is free. Note that 
processors are chosen in the ascending order of the processor array, 
because uniform inter-processor communication time is assumed. In the 
case of nonuniform inter-processor communication, this part of the 
program can be modified so that a processor is chosen to minimize the 
communication time. 
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4.3 DISCUSSION OF THE RESULTS 

The following three parameters are computed: 

1) Total execution time. 

2) Algorithm execution factor (AEF) defined as the ratio of the 
serial and parallel times. 

3) Hardware utilization factor (HUF) defined as the ratio of the 
AEF and the number of processors. 

It is found that the critical path of the task graph takes 300 time 
units, meaning that with the given task graph the minimum total execu- 
tion time is 300 time units. As shown in Fig. 4.2 in the serial 
execution time is 6500 units. The total execution time progressively 
decreases as the number of processors increases. The critical time is 
obtained with 29 processors, and an optimum schedule is achieved. 

Beyond this point the increase in the number of processors has no effect 
on the execution time. 

Fig. 4.3 shows the variation of the AEF with the number of proces- 
sors. Note that the maximum AEF cannot exceed the total number of 
processors. The results show that AEF almost takes its maximum value in 
each case and is 21.67 when n - 29. 

Fig. 4.4 is a plot of the HUF and the number of processors. A HUF 
of 100% means that the processors are fully utilized. It is observed 
that the HUF decreases with an increase in the number of processors. 

With 29 processors a hardware utilization factor of over 72% is 
achieved. 

It is concluded that for the given task graph the optimum schedule 
is achieved with 29 processors . At this point the speedup compared to 
sequential execution reaches its maximum possible value of 21.67 and the 
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hardware utilization is as high as 72.10%. These results show the 
validity of the parallel approach and also justifies the use of such an 
approach. The restructured method is much superior to the sequential 
algorithm and promises a substantial improvement in system performance. 
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Figure 4.2 Total Execution Time vs. Number of Processors 
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Figure 4.3 Algorithm Execution Factor vs. Number of Processors 
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Figure 4.4 Hardware Utilization Factor vs. Number of Processors 


Chapter 5 

HARDWARE AND SOFTWARE ASPECTS 


The design of a suitable multiprocessor computer, which optimally 
executes the various independent tasks, is discussed in this chapter. 

The parallelism analysis of the restructured algorithm assumes a 
multiprocessor environment with uniform interprocessor communication 
times and no hardware conflicts. As shown, the algorithm is optimally 
executed with 29 processors, providing the commanded acceleration in one 
direction. This chapter outlines the proposed computer architecture 
which is customized for the specific application. 


5.1 PITFALLS OF VON NEUMANN MULTIPROCESSING 

Most existing multiprocessors are variations of the von Neumann 
model of computation and have so far failed to yield any substantial 
benefits over single processor systems. As discussed by Arvind and 
Iaxmucci [25] , there are several problems confronting the von Neumann 
style of multiprocessing. 

The first problem is that of memory latency, the time between 
issuing a memory request and getting a response. If the computer 
contains a significant number of processors, and each is fast enough 
that its cycle time is limited by the speed of light, then the physical 
size of the whole computer will make most of the memory a significant 
distance away from any one processor. That is, several instruction times 
are needed to access most of the memory, if it is to be shared. 
Competition by several processors for the same memory at the same time 
makes the problem more severe. In trying to solve this problem, many 
architects use only messages, prohibiting shared memory entirely and 
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surrendering flexibility and responsiveness. Others allow shared memory, 
but make local copies of the data in caches . This solution exchanges the 
latency problem for the cache coherence problem, i.e., how to maintain 
consistency when one or more of the copies is written. 

A second problem is that of effective synchronization. Parallel 
processes must be able to wait for others without having to execute many 
extra instructions or waste time in other ways, and without signifi- 
cantly affecting the other processes that are running in parallel and 
not waiting. The use of traditional methods like interrupts, limits the 
synchronization rates to once every few hundred instructions . Primitives 
like test-and-set which wait busily and thereby can avoid exchanging 
processes are better, but these approaches usually waste instructions to 
accomplish waiting. 

A third problem is the avoidance of bottlenecks which inhibit the 
amount of parallelism that can be attained, thereby limiting the number 
of processors that is practical. Changing an architecture, especially 
the instruction set, to correct bottlenecks in parallelism is ineffi- 
cient because it destroys software compatibility. 

5.2 DATA-DRIVEN PRINCIPLES 

The solution of the control problem necessitates an efficient and 
fast way of handling the movement of large amounts of data among various 
processors. This makes the data-driven mode of computation an ideal 
candidate. Moreover, the problems associated with the von Neumann style 
of multiprocessing are avoided at the very basic level in the data- 
driven computer. 

Instruction execution in a conventional von Neumann computer is 
under the control of the program counter. Whereas, the data-driven model 
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of computation is based on the following two principles: 

1) Any computation can proceed as soon as its operands become 

available. Potentially all operations that are thus enabled 
can execute simultaneously. 

2) All operations are free of side-effects, so that two enabled 

operations can execute in either order, or concurrently, 
without error. 

If a program has a sufficient amount of parallelism, then a data- 
driven processor can be kept fully utilized. In the previous chapters 
it is shown that there is an enormous amount of parallelism inherent in 
the avionics application. As discussed later, an execution unit in the 
proposed data-driven processing element receives enabled instructions 
only, and waiting for operands is done in a separate section. A 
data-driven processor, unlike a processor with a program counter, 
executes a stream of enabled instructions in a highly pipelined manner 
and allows greater freedom in the order of execution of the enabled 
instructions . 

Data-driven architectures are usually classified as either static or 
dynamic. In a static architecture the nodes of a program graph are 
loaded into memory before computation begins, and, at most, one instance 
of a node at a time is enabled for firing. A dynamic architecture 
facilitates the simultaneous firing of several instances of a node, and 
these can be created at runtime. The architecture proposed in this 
chapter is of the latter type. The parallelism analyses, in the previous 
chapters, are based on a single iteration of the integration of the 
fourteen differential equations, but there are obvious concurrencies 
between the various iterations. This architecture has the provision of 
unfolding the integration loop at runtime by creating multiple instances 
of the loop body and then executing these instances concurrently. 
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5.3 DIFFERENT SOFTWARE AND HARDWARE STRUCTURES 


In this section the salient software and hardware aspects of the 
proposed data-driven computer are outlined. The previous chapters 
emphasized the software aspects which are accomplished in the prepro- 
cessing stage. The actual machine to execute the tasks constitutes the 
hardware aspect. 


5.3.1 Language Considerations 

The entire process of defining the algorithm, removing dependencies, 
constructing a task graph, and finally allocating the tasks to the 
various, processors must be completed before the actual execution starts. 
This process is deliberately kept language - independent to gain flexibil- 
ity. Since the architecture is data-driven, it is more advantageous to 
use a functional language than a conventional imperative language as the 
high-level language to represent the problem. There is a difference 
between the high-level language required to represent the problem and 
the base language which is efficiently implemented by the architecture. 
The high-level language should satisfy the following properties: 

1) Freedom from side-effects : This is necessary to ensure that 
data dependencies are the same as the sequencing con- 
straints. Global variables are not allowed and procedures 
cannot modify variables in the calling program. Updating a 
variable results in the creation of new variables. 

2) Locality of effect : To avoid memory overflow variables 
should have a definite region of operation or scope. This 
also avoids the apparent dependencies that result from 
duplication of labels. 

3) Equivalence of instruction scheduling constraints with data 
dependencies : This means that all the information needed to 
execute a program is contained in the task graph. 
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4) Single assignment : This means that each variable may appear 
on the left side of only one assignment statement in the 
part of the program in which it is active. 

There are three main categories of programming languages, func- 
tional, actor, and logic, that are suitable for data-driven computa- 
tions. Functional languages can either be single-assignment, like ID, 
VAL, VALID, and LUCID, or applicative, like pure LISP, SASL, and FP. 
Actor languages are programming systems composed of objects that 
interact only by sending and receiving messages. SMALLTALK is an actor 
language. Logic languages are based on symbolic logic and PROLOG is an 
example. Any one of these languages can be chosen as a high-level 
language for this architecture. 

The base language of this computer is the graphical representation 
termed the task graph, discussed in Chapter 3. The machine efficiently 
executes the tasks shown in the task graph, satisfying the precedence 
constraints of the graph. A task can be executed as soon as all its 
inputs are available. 

5.3.2 Tagged Tokens 

In a manner similar to Arvind [26] and the Manchester Dataflow 
machine [27], information is carried by tokens that flow along the arcs 
of the task graph. A task is enabled when, and only when, all of its 
input tokens are present. An enabled task fires by absorbing its input 
tokens and producing output tokens that carry the result as their value. 
The order of execution is unimportant since there are no races . 
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5.3.3 The Overall Architecture 


The block diagram of the overall architecture is shown in Fig. 5.1. 
There are three basic stages, the preprocessing stage, the execution 
stage, and the output stage. In the preprocessing stage a conventional 
host computer gathers the various input data and coordinates the overall 
activities. An important part of this stage is the allocation unit which 
allocates the tasks to the various processors in an optimal fashion. 
Following the allocation phase, the data and instruction tokens are 
downloaded onto the individual memories of the processing elements 
(PEs). 

In the heart of the architecture lies the data-driven execution 
stage or the PE array. There are two clusters of 64 PEs each, which are 
connected in a star configuration. As noted in Chapter 4, the optimum 
execution of the algorithm requires 29 PEs. 60 'workhorse' PEs are used 
for computation purposes only, and the remaining 4 PEs in each cluster 
are dedicated for various purposes. One dedicated PE is the communica- 
tion link between the preprocessing stage and the cluster. The second 
dedicated PE is reserved for diagnostic purposes. This diagnostic 
processor helps in recovery from faults and in reassignment of PEs. The 
third is reserved for inter-cluster communication, and the remaining one 
serves as a link between the cluster and the output stage. At any given 
time, only 29 workhorse PEs function within a cluster. The others remain 
in standby and are used when a PE must be aborted after a fault is 
detected. 


51 


For communication with 
preprocessing stage 
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Fig. 5.1 Block Diagram of the Overall Architecture 
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Two clusters are required because an identical number of calcula- 
tions must be performed to compute the acceleration of the vehicle in 
the transverse direction. The efficiency of the - architecture results 
from the fast movement of data through the execution stage, which is 
devoid of the conventional von Neumann bottlenecks . 

The output stage receives the final result tokens from the execution 
stage via the dedicated PEs in each cluster. These tokens are converted 
into a form that serves as an input to the subsequent motion control 
actuators . 

5.3.4 Network Topology of the Execution Stage 

Among the possible configurations are the ring, tree, completely 
connected, and the star topologies. In the ring network N processors are 
connected on a circular bus . Only 1/N of the bus bandwidth is available 
to each processor, and failure of a single node or path within the ring 
may halt communication in the entire ring. To alleviate this problem, 
designers have constructed partially and completely connected rings at 
the expense of increased network complexity and cost. Also, the number 
of nodes in a ring is limited because message delays increase linearly 
with the number of nodes, making a ring inefficient for heavy traffic. 

A tree network uses the minimal number of connections per processor. 
Communication between remote leaves faces a bottleneck towards the top 
of the tree, and the data paths become longer as the number of nodes 
increases. Hence a tree is also unsuitable for heavy communication. The 
completely connected network requires N^ connection links for N 
processors, which is prohibitively expensive. 
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The star network has both logical and hardware simplicities. If the 
tasks are uniformly distributed, most messages traverse only two 
communication paths. A major vulnerability of this topology is the 
active hub which, not only introduces queuing delay, but also disables 
the entire network, once it fails. For this reason a passive hub, which 
is nothing but a physical connection of the various paths, is used. 

The result token from every PE is broadcast to all PEs in the 
cluster. The result token contains a field denoting the destination 
address, and all PEs decode this field to find a match. 

5.3.5 Crossbar Switching 

An alternative to broadcast communication is the use of crossbar 
switch networks. Communication is inherently sequential if the message 
is broadcast to all the PEs. A high-speed crossbar switch can be used at 
the hub of the star for routing the information to the appropriate 
processor. The crossbar switch gives the cluster the capability of 
parallel communication between pairs of processors. Extremely fast 
switches are available which make the switching time negligible compared 
to token formatting and communication times. 

An n by m crossbar switch is a device with n inlets, m outlets, and 
an array of n*m contacts, sometimes called crosspoints, for connecting 
each inlet to each outlet. A crossbar network is an interconnection of 
crossbar switches in accordance with certain rules . The switches must be 
partitioned into a number of classes called stages in such a way that 
all switches in a given stage have the same number of inlets and 
outlets . The inlets of the switches in the first stage are the inputs of 
the network. The outlets of the switches in each stage except the last 
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are connected in an one-to-one fashion to the inlets of the switches in 


the following stage by links. Finally the outlets of the switches in the 
last stage are the outputs of the network. 

Delta networks are multiple -stage networks with each stage consist- 
ing of several crossbar switches, as shown in Fig. 5.2. The switches in 
a buffered delta network have buffers to temporarily store tokens which 
cannot be forwarded in the current cycle. An N by N buffered delta 
network can be constructed from B by B crossbar switches, each capable 
of forwarding a token that arrives at any of its B Inputs to any of its 
B outputs (see Fig. 5.2, where N-8 and B-2) . The network has n/b stages 
(numbered 1,2,..., n/b, where n-log 2 N and b-log 2 B) , and each stage has 
2 ( n- b) crossbar switches. 

For each of its output ports, a switch selects one token from the 
set of tokens contending for that port and offers it to the next stage 
connected to that port. The output port through which a token leaves the 
switch is determined by the switch from a destination address included 
in the token. Generally, the output ports requested by the tokens at the 
heads of the switch buffers are considered. For each of these output 
ports , one of the requesting tokens is selected equiprobably and offered 
to a switch in the next connected stage. The switches with input tokens 
forward them to the intended buffers , if these buffers have a vacancy at 
the beginning of the clock cycle. For each accepted token an acknowledge 
signal is sent to the switch from which the token came. 

Three major factors which influence the performance of a buffered 
delta network are the size of the switches, the size of the buffers used 
in each switch, and their position with respect to the switch. It is 
shown in [28] that for small buffer sizes, delta networks constructed 
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with 4 by 4 switches provide slightly better throughput and substan- 
tially lower delay than 2 by 2 switches. However, for large buffer sizes 
the delta networks constructed from 2 by 2 switches provide better 
throughput at the expense of larger delays. 

Tokens are blocked when there is either more than one token in the 
switch (switch blocking) or insufficient buffer capacity (buffer 
blocking) . In networks with large buffer sizes , buffer blocking is 
minimized and the degradation of throughput is primarily due to switch 
blocking. Since switch blocking increases with the size of the crossbar 
switch, it is advantageous to use 2 by 2 switches instead of 4 by 4 
ones. The buffers can be provided at the input links of each switch, as 
shown in Fig. 5.3a, or they can be inside the switch as shown in Fig. 
5.3b. Kumar and Jump [28] have deduced that with large buffer sizes it 
is advantageous to use buffers inside the switches, in terms of both 
throughput and delay. 

In a crossbar switch several tokens may simultaneously request the 
same output port, so various priority schemes can be used for selection. 
The simplest method is to select one of the tokens randomly. Another is 
the rotating priority scheme in which all buffers in the switch are 
assigned a permanent cyclic order. In each clock cycle all the buffers 
are considered in this order, starting from a designated high-priority 
buffer. The buffer adjacent to the high-priority one in the current 
cycle becomes the high-priority buffer in the next cycle. In another 
scheme, a token in any buffer is considered to have a priority equal to 
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Fig. 5.2 A Buffered Delta Network 


the number of tokens in that buffer. For tokens with the same priority 
measure, one is selected equiprobably. From the results shown in [28], 
the performance of the random selection scheme is found to be similar to 
the other two schemes, and is the easiest one to Implement. 
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Fig 5.3a Crossbar Switch with buffers at the input links 



58 



5.3.6 Token Format 


As mentioned earlier, information is communicated between PEs via 
tokens or packets. There are two types of tokens, the instruction token 
containing the required information for a node execution and the data 
token carring data required to enable the node. A 64-bit token size 
accomodates a 32 -bit data or operation field and 32 bits for the control 
directives. The control portion of the token includes the following 
subf ields : 

i) Check field : This determines whether it is an instruction 
token or a data token. 

ii) Module field : This identifies the block of code (procedure 
or loop) to which the token belongs. 

iii) Instruction number field : This identifies the instruction 
number within a specific block. 

iv) Processor field : This denotes the processor responsible 
for executing the code. 

v) Error check field : This contains information for checking 
the validity of the token. Error detection and correction 
(EDAC) codes like cyclic redundancy check (CRC) and Hamming 
codes can be used. 

vi) Data counter field : This is a part of the data token only. 
It indicates the number of operands required to enable a 
node. 

The remaining 32 bits of the token contains the data value or the 
instruction code, as the case may be. From the software point of view, a 
longer token is better since it can carry more information. On the other 
hand, a shorter token is better from the hardware point of view, because 
it reduces the amount of hardware and network connections. Hence, the 
size of the token should be chosen in an optimal manner. 
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5.3.7 Design of the Processing Element 

The processing element (PE) is designed to meet some specific 
requirements. Every token has a tag or control field, and the PE has the 
ability to decode the tag and route the token to the separate components 
of the PE. The PE provides local storage for instruction and data 
tokens. A Contents Associative Memory (CAM) is simulated using the 
hardware hashing technique. A hash table is accessed by the hash key 
generated from the tag at a very high search speed. The PE provides 
circuitry for EDAC decoding and encoding, and has the ability to detect, 
isolate, and rectify faults. Most of the units are self- checking. 

A block diagram of the proposed PE design is shown in Fig. 5. 4. It 
consists of an input queue, an EDAC decode unit, a wait-store-match 
unit, an execution unit , an output unit, and the overflow and intermedi- 
ate buffers . 

The input queue is a FIFO buffer, receiving tokens from other PEs 
and sending them to the EDAC decode unit. It works as a rate balancing 
mechanism, attempting to even the rate of token production and consump- 
tion. Therefore,' it allows the wait-match- store unit and the execution 
unit to work concurrently. 

The EDAC decode unit checks the error code of the token. If a 
correctable fault is detected, it passes the rectified token to the 
wait -match -store unit. If the fault is not rectified, it informs the 
diagnostic unit. 

The wait-match- store unit consists of a code memory to store the 
instructions and an operand memory for data. After a token is received, 
the hashing mechanism generates a hash key to address the hash table , 
called the Operand Block Table (OBT) . Each entry of the operand memory 
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consists of 34 bits, which includes a 2 -bit Operand Enable Flag (OEF) 
and a 32-bit data value. The OEF shows the existence of the operand. 

It is set to 00 if no token has arrived. When a data token arrives, the 
OEF is set to 10 or 01, depending on whether it is the left or right 
operand (denoted by the data counter field) , and the data is stored in 
the data area. 

After a token is received, the unit starts accessing both the 
operand and code memories simultaneously. If the instruction is monadic, 
the operand memory is not searched, and the executable packet is 
immediately generated. For a dyadic instruction the operand memory is 
searched, and if the matched token is found, the executable packet is 
generated, and the OEF is set to 00. The executable packet is passed 
onto the execution unit. 

The execution unit performs all arithmetic and logic instructions. 
Some commonly used instructions can be hardwired to enhance the speed of 
execution. The result tokens are forwarded to the output unit. 

The output unit generates the tag field of the result token. The 
token is properly formatted and the EDAC code is embedded in it. Since 
it is possible to encounter delay while transmitting a token through the 
communication network, a buffer is also provided in the output unit. 

The overflow memory is provided to augment the operand memory. An 
overflow occurs when all locations in the operand memory are occupied. 
Then the unmatched incoming token is stored in the overflow buffer, and 
indicator flags are set up to notify subsequent tokens. The intermediate 
buffer stores the. matched tokens of an enabled instruction, so that the 
tokens can be retrieved when a fault is detected after execution. 
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Fig. 5.4 Block Diagram of the proposed PE 
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The diagnostic unit periodically checks the major components of the 
PE. These tests can be executed either in parallel with the normal 
execution or when the tested unit is inactive to avoid any degradation 
of system performance. Once a fault is detected, the diagnostic unit 
aborts the PE and informs the diagnostic processor, which in turn 
initializes the reassignment process. The diagnostic unit can prevent 
the output unit from transmitting a faulty token, and thus can localize 
the fault. 

5.3.8 Fault Tolerance 

Fault tolerance is an important requirement of any multiprocessor 
architecture. In the proposed architecture both hardware and software 
fault tolerances are used to improve system reliability. The workhorse 
PEs within a cluster are duplicated to provide hardware redundancy, and 
are built with extremely reliable components. The communication network 
is very rugged and reliable. Fine-grained fault tolerance is provided 
within every PE with the help of self- checking circuitry and diagnostic 
units. 

Software fault tolerance is provided by watchdog timers and EDAC 
codes. A watchdog timer is a simple and inexpensive way of monitoring 
proper process functions. A timer is maintained as a process separate 
from the one it checks, and is set as soon as the process starts. The 
process resets the timer after completing successful completion. If the 
timer is not reset, then a process failure is assumed. 
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Transmission faults are easily detected and corrected using various 
codes. A simple example is the single error correction and double error 
detection Hamming code. For a 8 -bit data word the encoder generates a 
12 -bit word like 


H 1 h 2 d 3 h 4 d 5 d 6 d 7 h 8 d 9 d 10 d 11 d 12 

where H's are the check bits and D's are the data bits. 

The following are called the syndrome equations : 

«» + D3 + D5 + Dy + Dg + 

S2 “ H2 + D3 + Dg + Dy + Dj_q 

S4 — H4 + D5 + Dg + Dy + D^2 

s 8 “ h 8 + d 9 + D 10 + D 11 + °12 

where '+' denotes an exclusive OR operation. While generating the code, 
the check bits are produced by setting the syndrome bits to zero. 

During the checking process, the syndrome bits are checked. If any of 
the resulting syndrome bits is nonzero, then a detectable error has 
occurred. An error can be corrected provided only bit is erroneous. 

The binary number SgS^^S^ gives the position of the erroneous bit. For 
example, if is changed, then Sg and S 4 are nonzero and SgS 4 S 2 S^ - 

1100 - 12 in decimal. This code can be extended to accommodate longer 

data words. 

5.3.9 Stacked Hybrid WSI Technology 

The entire architecture must be housed within a small package. Hence 
the dimensions, weight, and cost of the hardware are important consider- 
ations. Hybrid Wafer Scale Integration (WSI) is a possible solution. 

This technology involves scribing the wafer after fabrication. The 
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actual PEs are then separated and remounted in preassigned positions 
onto a substrate of polyimide. The inter-processor links are fabricated 
by ion- implantation techniques. Hybrid WSI partially eliminates the two 
major problems of traditional WSI, viz., yield and power dissipation. 
Since each PE is scribed and then separated, partial testing may be 
performed prior to remounting. Power is more easily dissipated through 
the polyimide substrate. 

Multi -stack wafers can be used to house the two clusters and other 
units. The 3D Computer Studies Department at the Hughes Research 
Laboratories, Malibu, California has built an image processing cellular 
array of stacked CMOS wafers with feedthroughs and interconnects. A 
significant advantage of such a scheme is the upgradeability of the 
architecture as additional features are accommodated by introducing more 
wafer stacks. 
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Chapter 6 

CONCLUSIONS AND RECOMMENDATIONS 


The previous chapters have discussed the need, development, 
utilization, and validity of this research. Experimental results have 
been discussed in Chapter k. The purpose of this chapter is to express 
some general conclusions and recommendations. 

6.1 CONCLUSIONS 

This research has proved that acceptable results can be obtained by 
using parallel processing in real-time systems. It has shown that 
enhancement of avionics design and vehicle control is possible by 
computing the guidance commands in real-time, exploiting the parallelism 
inherent in the problem. 

There are various ways of applying parallel processing techniques to 
meet the need of rapid and real-time computation. It is concluded that 
one of these approaches, outlined in this report, comprises the follow- 
ing two major phases : 

1 . ) The sequential algorithm is suitably restructured by remov- 

ing dependencies, identifying concurrent tasks, exploiting 
optimum parallelism, and optimally allocating the tasks to 
available resources . 

2. ) Appropriate hardware structures are designed to implement 

the parallel or modified algorithm. 

Together, the above two phases constitute an innovative, customized 
computer architecture for the algorithmic execution of a real-time 
system. The data-driven mode of computation is ideally suited for the 
real-time solution of control processing, avoiding the bottlenecks of 
von Neumann multiprocessing. 
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This research has also demonstrated the significance of the alloca- 
tion process in a parallel processing application. Optimal allocation 
can be achieved with the help of heuristic algorithms. 


6.2 RECOMMENDATIONS 

The scope of this research is not limited to the specific field of 
guidance and control. The authors recommend the use of the techniques, 
outlined in this report, to similar problem areas in other real-time 
systems . 

The significance of the allocation process has been demonstrated by 
this research. The authors suggest the design of allocation algorithms 
which are themselves suitable for parallel implementation, an example of 
which is the branch and bound technique, discussed in section 3.4.2. 

The effect of using decoupling techniques to reduce dependencies 
between differential equations should be investigated. The performance 
can be further improved by defining tasks with optimum granularity. This 
can be achieved by striking a suitable balance between the execution and 
communication times. The authors also recommend further research in the 
design of hardware structures which are capable of executing specific 
algorithms and graphs . 
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Appendix A 
TASK LIST 


DESCRIPTION 

NO. 

TIME 

PREDECESSORS 

SUCCESSORS 

Entry 

1 

0 

None 

2. 3, 4, 5 

a— w n *w n 

2 

30 

1 

6,45,46,57,58,69,70 





77,78,95,96,105,106 

b— 2*w*w n 

3 

50 

1 

49,50,63,64,75,76,81, 




82,101,102 

INC i 

4 

20 

1 

7,8, . . .108,109 

A<j 

5 

20 

1 

87,88,91,92,97,98, 




103,104 

Q-a*a/r 

6 

60 

2 

35,36,37,38,39,40,83,84 

PIP 

7 

110 

4 

43 

Pl c 

8 

110 

4 

44 

P2P 

9 

110 

4 

45,55,87 

P2 C 

10 

110 

4 

46,56,88 

P3P 

11 

110 

4 

49,61 

P3 C 

12 

110 

4 

50,62 

P4P 

13 

110 

4 

35,41,45,49,67 

P4° 

14 

110 

4 

36,42,46,50,68 

P5P 

15 

110 

4 

61,91 

P5 C 

16 

110 

4 

62,92 

P6P 

17 

110 

4 

67,69,97 

P6 C 

18 

110 

4 

68,70,98 

P7P 

19 

110 

4 

37,43,53,57,63,77,103 

p?c 

20 

110 

4 

38,44,54,58,64,78,104 

P8P 

21 

110 

4 

75 

P8 C 

22 

110 

4 

76 

P9P 

23 

110 

4 

39,47,59,69,71,75,79,81 

P9 C 

24 

110 

4 

40,48,60,70,72,76,80,82 

P10P 

25 

110 

4 

51,65,77,81,83,105 

P10 c 

26 

110 

4 

52,66,78,82,84,106 

PI IP 

27 

110 

4 

91 

Pll c 

28 

110 

4 

92 

P12P 

29 

110 

4 

95 

PI2 C 

30 

110 

4 

96 

P13P 

31 

110 

4 

101 

P13 c 

32 

110 

4 

102 

P14P 

33 

110 

4 

89,93,95,99,101,107 

P14 c 

34 

110 

4 

90,94,96,100,102,108 

TEMP1P-Q*P4P 

35 

30 

6,13 

41,43,47,51,89,93 

TEMPI 0 

36 

30 

6,14 

42,44,48,52,90,94 

TEMP2P-Q*P7P 

37 

30 

6,19 

53,59,65 

TEMP2° 

38 

30 

6,20 

54,60,66 

TEMP3P-Q*P9P 

39 

30 

6,23 

71,79,99 

TEMP3° 

40 

30 

6,24 

72,80,100 

f IP— P4P*TEMPlP 

41 

30 

13,35 

109 

fl° 

42 

30 

14,36 

119 

f2P-PlP-P7P*TEMPl 

43 

50 

7,19,35 

109 

f2° 

44 

50 

8,20,36 

109 

TEMP4P-a*P4P+P 2 P 

45 

50 

2,9,13 

47 
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TEMP4 C 

46 

50 

2,10,14 

48 

f3P-TEMPlP*P9P + 

47 

50 

23,35,45 

109 

TEMP4P 

f3 c 

48 

50 

24,36,46 

109 

TEMP5P-P3P+P4P*b 

49 

50 

3,11,13 

51 

TEMP5 C 

50 

50 

3,12,14 

52 

f4P-TEMPlP*P10P + 

51 

50 

25,35,49 

109 

TEMP5P 

f4 c 

52 

50 

26,36,50 

109 

TEMP6P-TEMP2P*P7P 

53 

30 

19,37 

55 

TEMP6 C 

54 

30 

20,38 

56 

f 5P-2*P2P -TEMP6P 

55 

40 

9,53 

109 

f5 c 

56 

40 

10,54 

109 

TEMP7P-a*P7P 

57 

30 

2,19 

61 

TEMP7 C 

58 

30 

2,20 

62 

TEMP8P-TEMP2P*P9P 

59 

30 

23,37 

61 

TEMP8 C 

60 

30 

24,38 

62 

f6P-P3P+P5P+TEMP7P 

61 

60 

11,15,57,59 

109 

-TEMP8P 

f6 c 

62 

60 

12,16,58,60 

109 

TEMP9P-b*P7P 

63 

30 

3,19 

67 

TEMP9 C 

64 

30 

3,20 

68 

TEMPIOP— TEMP2P*P10P 

65 

30 

25,37 

67 

TEMPIO® 

66 

30 

26,38 

68 

f7P-P6P+P4P+TEMP9P 

67 

60 

13,17,63,65 

109 

-TEMPIOP 

f7 c 

68 

60 

14,18,64,66 

109 

TEMPllP-P6P+a*P9P 

69 

50 

2,17,23 

73 

TEMP11 C 

70 

50 

2,18,24 

74 

TEMP12P-TEMP3P*P9P 

71 

30 

23,39 

73 

TEMP12 C 

72 

30 

24,40 

74 

f8P-2*TEMPllP 

73 

40 

69,71 

109 

-TEMP12P 

f8 c 

74 

40 

70,72 

109 

TEMP13P-P8P+b*P9P 

75 

50 

3,21,23 

79 

TEMP13 C 

76 

50 

3,22,24 

80 

TEMP14P-P7P+a*P10P 

77 

50 

2,19,25 

79 

TEMP14 C 

78 

50 

2,20,26 

80 

f9?— TEMP13P+TEMP14P 

79 

70 

23,39,75,77 

109 

-TEMP3P*P9P 

f9 c 

80 

70 

24,40,76,78 

109 

TEMP15P-P9P+b*P10P 

81 

50 

3,23,25 

85 

TEMP15 C 

82 

50 

3,24,26 

86 

TEMP16P-Q*P10P 

83 

30 

6,25 

85 

TEMP16 C 

84 

30 

6,26 

86 

flOP-2*TEMP15P 

85 

40 

81,83 

109 

-TEMP16P 

flO c 

86 

40 

82,84 

109 

TEMP17P-P2*A T 

87 

30 

5,9 

89 

TEMP17 C 

88 

30 

5,10 

90 

fl lP-TEMPl 7? - 

89 

50 

33,35,87 

109 

TEMP1P*P4P 

fll c 

90 

50 

34,36,88 

109 

TEMP18P-P11P - P5P*A T 

91 

50 

5,15,27 

93 
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TEMP18 C 

92 

50 

5,16,28 

94 

fl2P-TEMP18P- 

93 

50 

33,35,91 

109 

TEMP1P*P14P 

fl2 c 

94 

50 

34,36,92 

109 

TEMP19P-P12P+a*P14P 

95 

50 

2,29,33 

99 

TEMP19 C 

96 

50 

2,30,34 

100 

TEMP20P-P6P*A T 

97 

30 

5,17 

99 

TEMP20 C 

98 

30 

5,18 

100 

f 13P-TEMP19P - 

99 

70 

33,39,95,97 

109 

TEMP20P - TEMP 3P*P4P 

fl3 c 

100 

70 

34,40,96,98 

109 

TEMP2lP-P13P+b*P14P 

101 

50 

3,31,33 

107 

TEMP21 C 

102 

50 

3,32,34 

108 

TEMP22P-P7P*A T 

103 

30 

5,19 

107 

TEMP22 C 

104 

30 

5,20 

108 

TEMP23P-Q*P10P 

105 

30 

2,25 

107 

TEMP23 C 

106 

30 

2,26 

108 

fl4P— TEMP23P*P14P 

107 

70 

33,101,103,105 

109 

+TEMP21P -TEMP22P 

fl4 c 

108 

70 

34,102,104,106 

109 

CMP 

109 

50 

41... 44, 47, 48, 51, 

110 


52,55,56,61,62,67, 

68,73,74,79,80,85, 

86,89,90,93,94,99, 

100,107,108 

EXIT 110 0 109 NONE 
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Appendix B 
TASK GRAPH PROGRAM 


{ Author : Ar indam Saha All Rights Reserved by author } 
{ Program explained In Section 3.2 also. } 
{ Extensive documentation is provided with the program. } 
{ This program maps any given task graph onto a group of processors. } 
{ It requires the vertices and the edges of the task graph as its } 
( input and provides the schedule , i . e . , which task is to be executed ) 
{ by which processor and at what time. The number of processors is a } 
{ variable. The program implements the CP/MISF (explained in chapter } 
( three) algorithm. ) 


program cpmisf( input, output) ; 
type 

proc - record 

busy : boolean; { Each processor is either busy or free } 
end; 

{ This is the task definition } 
tas — record 

enabled : boolean; 

assigned : boolean; 
executed : boolean; 
time : integer; 
resource : integer; 
starttime : Integer; 
end; 

matrix-array [ 1 . . 110 , 1 . . 110 ] 
var 

a.newa : matrix; { a : adjacency matrix newa : modified a after renumber } 
pr.prl : array [1. .110] of real; { priority lists ) 
time, imsucc, newtime : array [1..110] of integer; 
i, j ,k,l,v, t,p, serialtime : integer; 

f ilvar , f ilvarl : text; { input and output data files ) 

task : array [1. .110] of tas; 
processor :array[l. .35] of proc; 
over : boolean; 

speedup, eff : real; { performance indices ) 

( This procedure reads the input data and initializes all the variables } 

procedure initialize; 
var 

x,y,e,vl,v2 : integer; 
begin 

readln(filvarl,v,e) ; 
for x:-l to v do 
for y:-l to v do a[x,y] :-false; 


( Task is enabled when all its predecessors } 
{ have been executed } 
{ Task is assigned to an available processor } 
{ Task has finished execution or not } 
( Execution time of the task } 
{ The processor number to which it is assigned } 
{ Time instant at which it starts execution } 


of boolean; 
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for j:-l to e do begin 
readln( f ilvarl , vl , v2 ) ; 
a[vl,v2] :-true ; 
end; {for} 

for i:-7 to 109 do a[4,i] :-true; { This is for the particular graph. } 
for j >1 to v do begin 
readln(f ilvarl , time [ j ] , imsucc [ j ] ) ; 
end; (for) 
end; (initialize) 

{ This procedure calculates the level (defined in chapter 3) of each task. } 

procedure level; 
var 

b : array[l. .110,1. .110] of integer; 
temp , temp4 : real ; 
begin 

for i:-l to v do 
for j:-l to v do 

if a(i,j] then b(j,i]:-0 else b[j ,i] maxint; 

pr[v] :-time[v] ; prl[v] :-time[v] ; 
for i:-l to (v-1) do begin 
temp:-0.; k:-v-i;temp4:-0. ; 
for j:-k to (v-1) do begin 

prl[k] :**prl[j+l] + b(j+l,k] + time[k]; { prl is used to calculate the } 

{ critical time of the graph. } 
pr[k] :-pr[j+l] + b(j+l,k] + time[k] + imsucc[k]/v; { The last factor } 

{ considers the effect due to the number of successors } 

if pr[k]>temp then temp:-pr[k] ; 
if prl[k]>temp4 then temp4:-prl[k] ; 
end; {for j } 

pr[k] :-temp;prl[k] :-temp4; 
end; {for i) 
end; {level} 

{ This procedure renumbers the tasks according to their priorities, with } 

{ task one having the highest priority. ) 

procedure renumber; 
var 

max : real; 

newno : array [ 1. . 110] of integer; 
begin 

for k'-l to v do begin 
max:-pr[lj; j 1 ; 

for i:-2 to v do begin 
if pr[i]>max then begin 
max:-pr[i] ; 

j:-i; 

end; { if} 
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end; {for i} 


newcime[k] :-Cime[j] ;pr[j]: — 1. ; 
newno[ j ] :-k; 
end; {for k} 

{ for i:-l Co v do writeln(filvar, 'New number[ ' ,i, ' ] - ' ,newno[i] ) ; } 

{ Modifying Che adjacency macrlx according Co Che new numbers } 

for i:-l Co v do 

for j:-l Co v do newa[i, j J :-false; 
for i:-l Co v do 

for j :-l Co v do if a[i,j] Chen newa[newno[i] ,newno[j ] ] :-crue; 
end ; { renumber } 

{ This procedure updaCes Che adjacency macrix when Cask z has finished } 

( execuCion. ) 

procedure updaCe(z : inceger) ; 

var 

m,n : inceger; 
begin 

for m:-l Co v do 
if newa[z,m] Chen begin 
newa[z,m] : -false; 

Cask[m] . enabled :-crue; 

for n:-l Co v do if newa[n,m] Chen Cask[m] .enabled: -false; 
end; (Chen) 
end; {updaCe} 

{main program} 

begin 

assign(filvarl, ' a : inpucl . dac ' ) ; 
assign(filvar, 'a:ouCpuc30.dac* ) ; 
resec (filvarl) ; 
rewrice(filvar) ; 

wriCeln( ' EnCer Che order of Che graph and Che no. of processors'); 

readln(v,p) ; 

inicialize ; 

level ; 

renumber; 

for i:-l co p do processor[i] .busy:-false; { all processors are inicialized } 
{ Tasks are inicialized ) 
for i:-l Co v do begin 

cask[i] .assigned:-false; 

Caskfi] .enabled: -false; 

Cask [ i ] . execuCed : -false ; 

Cask [ i j . Cime : -newcime [ i ] ; 
end; {for i) 

Cask[l] .execuCed :-Crue ; Cask [v] .execuCed :-crue ; 
updaCe (1); { Task one is Che enCry node } 
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t : —0 ; 

{ Initial tasks are asigned } 
for j :-l to p do 

for i:-l to v do if (task[i] .enabled and not(task[i] . assigned) and 
not(processor[j ] .busy)) then begin 
task[i] . resource :-j ; 
taskfij . starttime :-t; 
task[i] .assigned: -true; 
processor [j ] .busy: -true; 
end; {then} 

t :— 1; 

repeat 

for i:-l to v do if (task[i] .assigned and not(task[i] .executed) ) then begin 
task[i] . time:- task(i].time - 1; { finished one unit of time } 

if taskfi] .time-0 then begin 

task[i] .executed:- true; { task i has finished execution } 
update(i) ; 

processor [task[i] .resource] .busy:-false; { processor becomes free ) 
end; (time:-0) 
end; {for i} 

{ If any processor is free then we check all the enabled but not assigned } 

{ tasks which can now be assigned. } 

for j :-l to p do if not (processor [ j ] .busy) then 
for i:-l to v do if (task(i] .enabled and not(task[i] .assigned) and 
not (processorfj ] .busy)) then begin 
task[i] . resource :-j ; 
task[ij .assigned:-true; 
task[ij . starttime :-t; 
processor [j ] .busy: -true; 
end; {for i then} 

over: -true; 

for i:— 1 to v do if not(task[ i] . executed) then over:— false; 
t:-t + 1; 

until (over) ; { until all tasks have been executed } 

{ Outputting data } 

writeln(filvar, 'Critical time for this graph is - ',prl[l]); 
writeln(filvar, 'Total time taken - ’,(t-l),' units'); 
writeln(filvar , 'Task Starttime Resource'); 
for i:-2 to v-1 do begin 

write(filvar, ' ',i,' ' ,task[i] .starttime, ' ' , task[i] .resourc 

e) ; 

writeln(filvar) ; 
end; {for} 

{ Calculating the performance indices } 
serialtime :-0; 

for i:— 1 to v do serialtime :-serialtime + time[i]; 
speedup: -serialtime/(t-l) ; 
eff: -speedup /p; 
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writeln(filvar, 'With ',p,' processors Speedup-' .speedup, ' ; Efficiency-' ,eff) ; 
close ( filvar) ; 
close(filvarl) ; 
end. 


t 
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